231 research outputs found

    On the design of state-of-the-art pseudorandom number generators by means of genetic programming

    Get PDF
    Congress on Evolutionary Computation. Portland, EEUU, 19-23 June 2004The design of pseudorandom number generators by means of evolutionary computation is a classical problem. Today, it has been mostly and better accomplished by means of cellular automata and not many proposals, inside or outside this paradigm could claim to be both robust (passing all the statistical tests, including the most demanding ones) and fast, as is the case of the proposal we present here. Furthermore, for obtaining these generators, we use a radical approach, where our fitness function is not at all based in any measure of randomness, as is frequently the case in the literature, but of nonlinearity. Efficiency is assured by using only very efficient operators (both in hardware and software) and by limiting the number of terminals in the genetic programming implementation

    The FNL+MMA Instruction Cache Prefetcher

    Get PDF
    International audienceWhen designing a prefetcher, the computer architect has to define which event should trigger a prefetch action and which blocks should be prefetched. We propose to trigger prefetch requests on I-Shadow cache misses. The I-Shadow cache is a small tag-only cache that monitors only demand misses. FNL+MMA combines two prefetchers that exploit two characteristics of the I-cache usage. In many cases, the next line is used by the application in the near future. But systematic next-line prefetching leads to overfetching and cache pollution. The Footprint Next Line prefetcher, FNL, overcomes this difficulty through predicting if the next line will be used in the "not so long" future. Prefetching up to 5 next lines, FNL achieves a 16.5% speed-up on the championship public traces. If no prefetching is used, the sequence of I-cache misses is partially predictable and in advance. That is, when block B is missing, the nth next miss after the miss on block B is often on the same block B (n). This property holds for relatively large n up to 30. The Multiple Miss Ahead prefetcher, MMA, leverages the property. We predict the nth next miss on the I-Shadow cache and predict if it might miss the overall I-cache. A 96KB FNL+MMA achieves a 28.7% speed-up and decreases the I-cache miss rate by 91.8%

    A Phase Change Memory as a Secure Main Memory

    Get PDF
    International audiencePhase change memory (PCM) technology appears as more scalable than DRAM technology. As PCM exhibits access time slightly longer but in the same range as DRAMs, several recent studies have proposed to use PCMs for designing main memory systems. Unfortunately PCM technology suffers from a limited write endurance; typically each memory cell can be only be written a large but still limited number of times (107 to 109 writes are reported for current technology). Till now, research proposals have essentially focused their attention on designing memory systems that will survive to the average behavior of conventional applications. However PCM memory systems should be designed to survive worst-case applications, i.e., malicious attacks targeting the physical destruction of the memory through overwriting a limited number of memory cells. In this paper, we propose the design of a secure PCM-based main memory that would by construction survive to overwrite attacks

    Don't Use the Page Number, but a Pointer on It

    Get PDF
    Most newly announced microprocessors manipulate 64-bit virtual addresses and the width of physical addresses is also growing. As a result, the relative size of the address tags in the L1 cache is increasing. This is particularly dramatic when small block sizes are used. At the same time, the performance of complex superscalar processors depends more and more on the accuracy of branch prediction, while the size of the Branch Target Buffer is also increasing linearly with the address width. In this paper, we apply the very simple principle enounced in the title for limiting the tag size of on-chip caches, and for limiting the size of the Branch Target Buffer. In an indirect-tagged cache, the anachronic duplication of the page number in processors (in TLB and in cache tags) is removed. The tag check is then simplified and the tag cost does not depend on the address width. Then applying the same principle, we propose the Reduced Branch Target Buffer. The storage size in a Reduced Branch Target Buffer does not depend on the address width and is dramatically smaller than the size of the conventional implementation of a Branch Target Buffer

    Bank-interleaved cache or memory indexing does not require euclidean division

    Get PDF
    International audience! Abstract Concurrent access to bank-interleaved memory structure have been studied for decades, particularly in the context of vector supercomputer systems. It is still common belief that using a number of banks different from 2 n leads to insert a complex hardware including a non-trivial divider on the access path to the memory. In 1993, two independent studies [1], [2] were showing that through leveraging a very simple arithmetic result, the Chinese Remainder Theorem, this euclidean division is not needed when the number of banks is prime or simply odd. In the mid 90's, the interest for vector supercomputers faded and the research topic disappeared. The interest for bank-interleaved cache has reappeared recently [3] in the GPU context. In this short paper, we extend the result from [1] and we show that, regardless the number of banks: Bank-interleaved cache or memory indexing does not require euclidean division

    Yet Another Compressed Cache: a Low Cost Yet Effective Compressed Cache

    Get PDF
    Cache memories play a critical role in bridging the latency, bandwidth, and energy gaps between cores and off-chip memory. However, caches frequently consume a significant fraction of a multicore chip’s area, and thus account for a significant fraction of its cost. Compression has the potential to improve the effective capacity of a cache, providing the performance and energy benefits of a larger cache while using less area. The design of a compressed cache must address two important issues: i) a low-latency, low-overhead compression algorithm that can represent a fixed-size cache block using fewer bits and ii) a cache organization that can efficiently store the resulting variable-size compressed blocks. This paper focuses on the latter issue. In this paper, we propose YACC (Yet Another Compressed Cache), a new compressed cache design that uses super-blocks to reduce tag overheads and variable-size blocks to reduce internal fragmentation, but eliminates two major sources of complexity in previous work—decoupled tag-data mapping and address skewing. YACC’s cache layout is similar to conventional caches, eliminating the back-pointers used to maintain a decoupled tag-data mapping and the extra decoders used to implement skewed associativity. An additional advantage of YACC is that it enables modern replacement mechanisms, such as RRIP. For our benchmark set, YACC performs comparably to the recently-proposed Skewed Compressed Cache (SCC) ‎[Sardashti et al. 2014], but with a simpler, more area efficient design without the complexity and overheads of skewing. Compared to a conventional uncompressed 8MB LLC, YACC improves performance by on average 8% and up to 26%, and reduces total energy by on average 6% and up to 20%. An 8MB YACC achieves approximately the same performance and energy improvements as a 16MB conventional cache at a much smaller silicon footprint, with 1.6% higher area than an 8MB conventional cach

    Selecting Benchmarks Combinations for the Evaluation of Multicore Throughput

    Get PDF
    Most high-performance processors today are able to execute multiple threads of execution simultaneously. Threads share processor resources, like the last-level cache, which may decrease throughput in a non obvious way, depending on threads characteristics. Computer architects usually study multiprogrammed workloads by considering a set of benchmarks and some combinations of these benchmarks. Because cycle-accurate microarchitecture simulators are slow, we want a set of combinations that is as small as possible, yet representative. However, there is no standard method for selecting such sample, and different authors have used different methods. It is not clear how the choice of a particular sample impacts the conclusions of a study. We propose and compare different sampling methods for defining multiprogrammed workloads for computer architecture. We evaluate their effectiveness on a case study, the comparison of several multicore last-level cache replacement policies. We show that random sampling, the simplest method, is robust to define a representative sample of workloads, provided the sample is big enough. We propose a method for estimating the required sample size based on fast approximate simulation. We propose a new method, workload stratification, which is very effective at reducing the sample size in situations where random sampling would require large samples.Aujourd'hui, la plupart des processeurs hautes performances sont capables d'exécuter plusieurs flots d'exécution simultanément. Ces flots d'exécution partagent les ressources du processeur, comme le cache de dernier niveau, ce qui peut réduire le débit d'exécution de manière difficilement prévisible, selon les caractéristiques de ces flots. Les architectes étudient généralement les charges multitâches en considérant un ensemble de charges de référence et des combinaisons de ces charges de référence. Comme les simulateurs précis au cycle près sont lents, nous voulons un ensemble de combinaisons qui soit aussi petit que possible, mais représentatif. Cependant, il n'existe pas de méthode standard pour la sélection de ces échantillons et différents auteurs ont utilisé différentes méthodes. Il n'est pas clair en quoi le choix d'un échantillon en particulier a une incidence sur les conclusions d'une étude. Nous proposons et comparons différentes méthodes d'échantillonnage permettant de définir des charges multitâches pour l'architecture des ordinateurs. Nous évaluons leur efficacité sur une étude de cas : la comparaison de plusieurs politiques de remplacement pour le cache de dernier niveau. Nous montrons que l'échantillonnage aléatoire, la méthode la plus simple, est robuste pour définir un échantillon représentatif de la charge de travail, à condition que l'échantillon soit assez grand. Nous proposons une méthode d'estimation de la taille de l'échantillon nécessaire basée sur une simulation rapide approximative. Nous proposons une nouvelle méthode, la stratification de charges multitâches, qui est très efficace pour réduire la taille de l'échantillon dans les cas où un échantillonnage aléatoire requerrait de grands échantillons

    Alternative Schemes for High-Bandwidth Instruction Fetching

    Get PDF
    Future processors combining out-of-order execution with aggressive speculation techniques will need to fetch multiple non-consecutive instruction blocks in a single cycle to achieve high-performance. Several high-bandwidth instruction fetching schemes have been proposed in the past few years. The Two-Block Ahead (TBA) branch predictor predicts two non-consecutive instruction blocks per cycle while relying on a conventional instruction cache. The trace cache (TC) records traces of instructions and delivers multiple non-consecutive instruction blocks to the execution core. The aim of this paper is to investigate the pros and cons of both approaches. Maintaining consistency between memory and TC is not a straightforward issue. We propose a simple hardware scheme to maintain consistency at a reasonable performance loss (1 to 5%). We also introduce a new fill unit heuristic for TC, the mispredict hint, that leads to significantly better performance (up to 20 %). This is mainly due to better prediction accuracy results and TC miss ratios. TBA requires double-ported or bank-interleaved structures to supply two non-consecutive blocks in a single cycle. We show that a 4-way interleaving scheme is cost-effective since it impairs performance by only 3 to 5%. Finally, simulation results show that such an enhanced TC scheme delivers higher performance than TBA when caches are large, due to a lower branch misprediction penalty and a higher instruction bandwidth on mispredictions. When the hardware budget is smaller, TBA outperforms TC because of a higher TC miss ratio and branch misprediction rate

    De-aliased Hybrid Branch Predictors

    Get PDF
    Fixed-size branch predictors tables suffer from a loss of prediction accuracy due to aliasing or interference. This is particularly true for predictors using a global history vector such as gshare. «De-aliased» global history predictors -the skewed branch predictor, the bimode predictor and the agree predictor- were recently proposed. «De-aliased» predictors consistently achieve the same prediction accuracy level as gshare or gselect using less than half the transistor budget. However different branches do not require the use of the same vector of information to be accurately predicted. Hybrid predictors combining several branch prediction schemes -- may deliver higher branch prediction accuracy than a branch predictor using a single branch prediction scheme. Then «de-aliased» branch are natural candidates as hybrid predictors components. In this paper, we show how cost-effective hybrid branch predictors can be derived from the enhanced skewed branch predictor e-gskew. 2Bc-gskew combines e-gskew and a bimodal branch predictor. It consists in four identical predic tor-table banks, i.e., the three banks from the e-gskew -including a bimodal bank- plus a meta-predictor. 2Bc-gskew-pskew combines a bimodal component, a global history register component and a per-address history component. These hybrid predictors are shown to achieve high prediction accuracy at a low hardware cost

    Managing SMT Resource Usage through Speculative Instruction Window Weighting

    Get PDF
    Simultaneous multithreading processors dynamically share processor resources between multiple threads. In general, shared SMT resources may be managed explicitly, e.g. by dynamically setting queue occupation bounds for each thread as in the DCRA and Hill-Climbing policies. Alternatively, resources may be managed implicitly, i.e. resource usage is controlled by placing the desired instruction mix in the resources. In this case, the main resource management tool is the instruction fetch policy which must predict the behavior of each thread (branch mispredictions, long-latency loads, etc.) as it fetches instructions. In this paper, we present the use of Speculative Instruction Window Weighting (SIWW) to bridge the gap between implicit and explicit SMT fetch policies. SIWW estimates for each thread the amount of outstanding work in the processor pipeline. Fetch proceeds for the thread with the least amount of work left. SIWW policies are implicit as fetch proceeds for the thread with the least amount of work left. They are also explicit as maximum resource allocation can also be set. SIWW can use and combine virtually any of the indicators that were previously proposed for guiding the instruction fetch policy (number of in-flight instructions, number of low confidence branches, number of predicted cache misses, etc.). Therefore, SIWW is an \emph{approach to designing SMT fetch policies}, rather than a particular fetch policy. Targeting fairness or throughput is often contradictory and a SMT scheduling policy often optimizes only one performance metric at the sacrifice of the other metric. Our simulations show that the SIWW fetch policy can achieve at the same time state-of-the-art throughput, state-of-the-art fairness and state-of-the-art harmonic performance mean
    corecore